Red Wine Quality Analysis by Michael Eckstein

The data set is comprised of red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

First, I’ll start by generating a summary on the data set to determine sample size (1599), number of features (13), and statistics for each variable.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## [1] 1599
## [1] 1599   13
## alcohol :  num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## chlorides :  num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## citric.acid :  num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## density :  num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## fixed.acidity :  num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## free.sulfur.dioxide :  num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## pH :  num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## quality :  int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## residual.sugar :  num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## sulphates :  num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## total.sulfur.dioxide :  num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## volatile.acidity :  num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## X :  int [1:1599] 1 2 3 4 5 6 7 8 9 10 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Plotting the data set based upon the quality shows that the majority of ratings for the data set are rated 5 or 6. This could have an impact on the validity of the analysis due to the smaller sample size of the other ratings. The output is a graphical reprsentation of the table above.

Next, I’ll explore the distribution of each feature of the data set.

I plotted each feature to view the distribution and determine if any skewness may need to be corrected within the data set. Looking at the graphs, several of the features can be transformed to help alleviate some of the long tails.

The plots of the features using log10 (residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, and sulphates) are much closer to a normal distribution which may help with some of our statistical models later on.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (residual.sugar, density, quality, fixed.acidity, chlorides, pH, volatile.acidity, free.sulfur.dioxide, sulphates, citric.acid, total.sulfur.dioxide, alcohol). None of the variables are ordered factor, but all numeric or integer values

Other observations: -The median quality is 6.0 ranging from a min of 3 and max of 8 on a scale of 0-10. -The quality has the following number of samples ( 3-10, 4-53, 5-681, 6-638, 7-199, 8-18) -The alocohol content of the red wine ranges between 8.4% and 14.90% with 75% of the red wines below 11.1%

What is/are the main feature(s) of interest in your dataset?

The main feature of the data set is quality. I’d like to determine which features have the greatest impact on the quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol, fixed.acidity, volatile.acidity, citric.acide, chlorides, total.sulfur.dioxide, density, sulphates, and alcohol are likely to contribute to the quality of red wine.

Did you create any new variables from existing variables in the dataset?

Yes, I factored the quality and created a new variable called qualityfactor to help differentiate the different ratings when plotting.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Fixed.acidity, volatile.acidity, density, pH, alcohol, and quality are close to normal distributions. Residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, and sulphates are skewed to the left in their distribution. Citric.acid is somewhat evently distributed, but appears to have a lot of values at 0 (132 total). It also appears that a few of the features such as residual.sugar, chlorides, free.sulfur.dioxodie, total.sulfur.dioxoide, and sulphates have outliers that could impact the analysis. I log transformed the left skewed distributions.

Bivariate Plots Section

In this section, I’ll explore the relationship between two features to determine the impact on quality. First I’ll look at the correlation between the features.

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000
## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(residual.sugar)
## t = 0.9407, df = 1597, p-value = 0.347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02551727  0.07247084
## sample estimates:
##        cor 
## 0.02353331
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(chlorides)
## t = -7.1508, df = 1597, p-value = 1.308e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2232336 -0.1282260
## sample estimates:
##      cor 
## -0.17614
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(free.sulfur.dioxide)
## t = -2.0041, df = 1597, p-value = 0.04522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.098865884 -0.001068979
## sample estimates:
##         cor 
## -0.05008749
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(total.sulfur.dioxide)
## t = -6.8999, df = 1597, p-value = 7.476e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2173510 -0.1221403
## sample estimates:
##        cor 
## -0.1701427
## 
##  Pearson's product-moment correlation
## 
## data:  quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(sulphates)
## t = 12.9672, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636092 0.3523323
## sample estimates:
##       cor 
## 0.3086419
## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The table shows the relationships between each of the features. Each specific Pearson test shows more details about the relationship of quality to that feature with p-values and other stats. This helps to support the confidence level of any conclusions made regarding the relationship of each feature to quality. I also determined that alcohol, sulphates, citric.acid, voltatile.acidity, and total.sulfur.dioxide have the strongest correlations so I’ll use those features to confirm the relationship in more detail.

This graph is a matrix of plots within the redwine data set to show the correlation between features. The correlations are important, but the smaller graphs within the plot don’t show much due to the overlap of the data plotted. The correlation table is more valuable in terms of exploring the relationships (and easier to read due to so much information plotted).

Next, I’ll use boxplots to perform a more detailed statistical analysis on the main features compared to quality.

Using the quality factor, we can use the boxplot for each feature to show the relationship with quality. Boxplots are better than scatter plots in this situation to view the trends since we factored quality and the other features are numeric continuous variables. It also shows the trend of median values within the data set which is a good indicator of the trend of the relationship. Alcohol starts with a consistent alcohol level based on quality but then increases sharply showing the impact of alcohol on quality. Sulphates show a similar upward trend, but contain a lot of outliers compared to other features so it would be interesting to understand the source of the outliers if possible. Citric.acid show a strong upwards trend with quality but the lower median values are skewed compared to other features. Volatile.acidity shows a negative trend as expected. Total.sulfur.dioxide in lower levels correlates to a high or low quality rating but a higher level of total.sulfur.dioxide is seen in mid-level quality. Total.sulfur.dioxide may not be as important of a feature looking at the trend.

Next, I want to add mean to the same plots to explore any major differences with median.

This set of plots are the same as the boxplots above except that mean is also plotted. Although median is a better indicator, the mean can show the impact of outliers or the difference due to large ranges. For the most part, mean and median were close for each quality with the exception of lower rated wines in the citric.acid plot.

Citric.acid and volatile.acidity have a strong negative relationship. As the citric.acid level decreases, the volatile.acidity levels increase. This supports the negative correlation found in the analysis above.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality has a strong correlation to alcohol, sulphates, citric.acid and a negative relationship to volatile.acidity. These relationships make sense based upon each attribute. Volatile.acidity is the amount of acetic acid in the wine and a higher value means more of an unpleasant, vinegar taste. Citric.acid can add freshness and flavor to wines. Sulphaes can help keep wine fresh. Total.sulfur.dioxide appears to have a lower relationship since low amounts are prevalent in lower and higher quality wines whereas higher amounts exist in mid-quality wines. Total.sulfur.dioxide (SO2) becomes evident over 50 ppm and becomes evident in the nose and taste of wine, which is why they are rated at the mid-level.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Quality also has a smaller correlation to fixed.acidity, chlorides, density. Citric.acid has a very strong relationship to fixed.acidity and a negative relationship to volatile.acidity. Since all of these are acids, they impact pH level of the wine. Density is a result of alcohol and sugar within the wine. Free.sulfur.dioxide has an impact on the total.sulfur.dioxide of the wine as well.

What was the strongest relationship you found?

The strongest relationship to quality is alcohol. Beyond quality, it was the fixed.acidity to pH and the fixed.acidity to density.

Multivariate Plots Section

In this section, I will explore the relationship between two features and quality.

By faceting the data set, we are able to more clearly see the relationship between the specific features and quality. Also, due to the faceting, we are able to see the differences in the number of samples for each quality rating. Sulphates vs. alcohol shows increased alcohol and sulphate levels lead to better wines (graph shifts from lower left to mid-right). Citric.acid vs. alcohol is a similar trend but has a few more data points lower in the graph than expected especially in quality level 7. Volatile acidity vs. alcohol shows a shift to the lower right due to the negative correlation. Sullphates vs. citric.acid shows a slight increase in sulphate levels and an increase in citric.acid could lead to better wines although there is some overlap with the other quality levels. Sulphates vs. volatile.acidity confirms the negative correlation of lower volatile.acidity to the majority of features explored. Volatile.acidity vs. citric.acid also displays the negative correlation between the two features. Lastly, exploring total.sulfur.dioxide vs. alcohol, higher total.sulfur.dioxide doesn’t appear to have a major impact on the quality unless it reaches a certain level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Exploring alcholol, sulphates, citric.acid, volatile.acidity, and total.sulfur.dioxide levels and the impact on quality, I was able to show a strong relationship between alcohol, acidity (higher citric.acid and lower volatile.acidity), and sulphates. Alcohol clearly has the largest impact but sulphates and citric.acid also show an interesting relationship since the higher quality wines were plotted in the upper right of the graphs for those features. On the other hand, due to the negative correlation, volatile.acidity and alcohol were plotted in the lower right.

Were there any interesting or surprising interactions between features?

The relationship between a lower volatile.acidity and higher citric.acid is more prevelant in the diagram. This supports that citric.acid adds flavor and freshness and the volatile.acidity negatively impacts the flavor.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

By plotting the sulphates of the wine to the alcohol content, we are clearly able to see the relationship between the two. As the alcohol content and sulphates increase, the quality of the wine also increases on a near linear scale. Compared to some of the other major features, sulphates probably have the lowest relationship to quality.

Plot Two

Description Two

We were able to determine that a strong correlation exists between alcohol and quality and a strong negative correlation exists between quality and volatile.acidity. By looking at the plot for each quality rating, as the quality increases, the points are closer to the lower right corner of the plots. This shows the relationship where a higher alcohol content and lower volatile.acidity produces a higher quality wine.

Plot Three

Description Three

This boxplots demonstrate the effect of citric.acid on the quality of wine since the 1st and 3rd quartile range increases as the quality also increases. Plotting the mean along with the boxplot shows the increase of citric.acid along with the quality. We also only see several outliers from the dataset. As a result, we are able to determine that the higher the citric.acid levels in a wine, the better the quality rating.


Reflection

I was able to investigate the different features of the data set and perform an analysis to determine which had the greatest impact on quality. The features that factored into quality the most were alcohol content, sulphates, and acidity (citric.acid and volatile.acidity). The correlations and graphs illustrated the relationships between these features and the trends that resulted from increasing or decreasing the amount of each in wine. Although there may be some variation, the highest quality wines were higher in alcohol content, sulphates, and citric.acid while having a lower volatile.acidity. This resulted in the freshest, best tasting wines that were desired most by the experts rating the wines. The analysis could be enriched by performing a more in depth comparison of the relationships between all of the different features. I only looked at the top 5 correlations so this analysis could help provide more information to support the conclusion. I also could have accounted for the entries of 0 for citric.acid (or other data quality). Overall, the analysis provided a great opportunity to explore the data set and enforce the skills learned through the lessons.